mitre corporation
Benchmarking LLM-Assisted Blue Teaming via Standardized Threat Hunting
Meng, Yuqiao, Tang, Luoxi, Yu, Feiyang, Li, Xi, Yan, Guanhua, Yang, Ping, Xi, Zhaohan
As cyber threats continue to grow in scale and sophistication, blue team defenders increasingly require advanced tools to proactively detect and mitigate risks. Large Language Models (LLMs) offer promising capabilities for enhancing threat analysis. However, their effectiveness in real-world blue team threat-hunting scenarios remains insufficiently explored. This paper presents CyberTeam, a benchmark designed to guide LLMs in blue teaming practice. CyberTeam constructs a standardized workflow in two stages. First, it models realistic threat-hunting workflows by capturing the dependencies among analytical tasks from threat attribution to incident response. Next, each task is addressed through a set of operational modules tailored to its specific analytical requirements. This transforms threat hunting into a structured sequence of reasoning steps, with each step grounded in a discrete operation and ordered according to task-specific dependencies. Guided by this framework, LLMs are directed to perform threat-hunting tasks through modularized steps. Overall, CyberTeam integrates 30 tasks and 9 operational modules to guide LLMs through standardized threat analysis. We evaluate both leading LLMs and state-of-the-art cybersecurity agents, comparing CyberTeam against open-ended reasoning strategies. Our results highlight the improvements enabled by standardized design, while also revealing the limitations of open-ended reasoning in real-world threat hunting.
Adversarial Threat Vectors and Risk Mitigation for Retrieval-Augmented Generation Systems
Ward, Chris M., Harguess, Josh
Retrieval-Augmented Generation (RAG) systems, which integrate Large Language Models (LLMs) with external knowledge sources, are vulnerable to a range of adversarial attack vectors. This paper examines the importance of RAG systems through recent industry adoption trends and identifies the prominent attack vectors for RAG: prompt injection, data poisoning, and adversarial query manipulation. We analyze these threats under risk management lens, and propose robust prioritized control list that includes risk-mitigating actions like input validation, adversarial training, and real-time monitoring.
LLM-Assisted Proactive Threat Intelligence for Automated Reasoning
Paul, Shuva, Alemi, Farhad, Macwan, Richard
Successful defense against dynamically evolving cyber threats requires advanced and sophisticated techniques. This research presents a novel approach to enhance real-time cybersecurity threat detection and response by integrating large language models (LLMs) and Retrieval-Augmented Generation (RAG) systems with continuous threat intelligence feeds. Leveraging recent advancements in LLMs, specifically GPT-4o, and the innovative application of RAG techniques, our approach addresses the limitations of traditional static threat analysis by incorporating dynamic, real-time data sources. We leveraged RAG to get the latest information in real-time for threat intelligence, which is not possible in the existing GPT-4o model. We employ the Patrowl framework to automate the retrieval of diverse cybersecurity threat intelligence feeds, including Common Vulnerabilities and Exposures (CVE), Common Weakness Enumeration (CWE), Exploit Prediction Scoring System (EPSS), and Known Exploited Vulnerabilities (KEV) databases, and integrate these with the all-mpnet-base-v2 model for high-dimensional vector embeddings, stored and queried in Milvus. We demonstrate our system's efficacy through a series of case studies, revealing significant improvements in addressing recently disclosed vulnerabilities, KEVs, and high-EPSS-score CVEs compared to the baseline GPT-4o. This work not only advances the role of LLMs in cybersecurity but also establishes a robust foundation for the development of automated intelligent cyberthreat information management systems, addressing crucial gaps in current cybersecurity practices.
Safeguard is a Double-edged Sword: Denial-of-service Attack on Large Language Models
Zhang, Qingzhao, Xiong, Ziyang, Mao, Z. Morley
Safety is a paramount concern of large language models (LLMs) in their open deployment. To this end, safeguard methods aim to enforce the ethical and responsible use of LLMs through safety alignment or guardrail mechanisms. However, we found that the malicious attackers could exploit false positives of safeguards, i.e., fooling the safeguard model to block safe content mistakenly, leading to a new denial-of-service (DoS) attack on LLMs. Specifically, by software or phishing attacks on user client software, attackers insert a short, seemingly innocuous adversarial prompt into to user prompt templates in configuration files; thus, this prompt appears in final user requests without visibility in the user interface and is not trivial to identify. By designing an optimization process that utilizes gradient and attention information, our attack can automatically generate seemingly safe adversarial prompts, approximately only 30 characters long, that universally block over 97\% of user requests on Llama Guard 3. The attack presents a new dimension of evaluating LLM safeguards focusing on false positives, fundamentally different from the classic jailbreak.
Harnessing AI for efficient analysis of complex policy documents: a case study of Executive Order 14110
Kramer, Mark A., Leavens, Allen, Scarlat, Alexander
Policy documents, such as legislation, regulations, and executive orders, are crucial in shaping society. However, their length and complexity make interpretation and application challenging and time-consuming. Artificial intelligence (AI), particularly large language models (LLMs), has the potential to automate the process of analyzing these documents, improving accuracy and efficiency. This study aims to evaluate the potential of AI in streamlining policy analysis and to identify the strengths and limitations of current AI approaches. The research focuses on question answering and tasks involving content extraction from policy documents. A case study was conducted using Executive Order 14110 on "Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence" as a test case. Four commercial AI systems were used to analyze the document and answer a set of representative policy questions. The performance of the AI systems was compared to manual analysis conducted by human experts. The study found that two AI systems, Gemini 1.5 Pro and Claude 3 Opus, demonstrated significant potential for supporting policy analysis, providing accurate and reliable information extraction from complex documents. They performed comparably to human analysts but with significantly higher efficiency. However, achieving reproducibility remains a challenge, necessitating further research and development.
Synthetic Medical Imaging Generation with Generative Adversarial Networks For Plain Radiographs
McNulty, John R., Kho, Lee, Case, Alexandria L., Fornaca, Charlie, Johnston, Drew, Slater, David, Abzug, Joshua M., Russell, Sybil A.
In medical imaging, access to data is commonly limited due to patient privacy restrictions and the issue that it can be difficult to acquire enough data in the case of rare diseases.[1] The purpose of this investigation was to develop a reusable open-source synthetic image generation pipeline, the GAN Image Synthesis Tool (GIST), that is easy to use as well as easy to deploy. The pipeline helps to improve and standardize AI algorithms in the digital health space by generating high quality synthetic image data that is not linked to specific patients. Its image generation capabilities include the ability to generate imaging of pathologies or injuries with low incidence rates. This improvement of digital health AI algorithms could improve diagnostic accuracy, aid in patient care, decrease medicolegal claims, and ultimately decrease the overall cost of healthcare. The pipeline builds on existing Generative Adversarial Networks (GANs) algorithms, and preprocessing and evaluation steps were included for completeness. For this work, we focused on ensuring the pipeline supports radiography, with a focus on synthetic knee and elbow x-ray images. In designing the pipeline, we evaluated the performance of current GAN architectures, studying the performance on available x-ray data. We show that the pipeline is capable of generating high quality and clinically relevant images based on a lay person's evaluation and the Fr\'echet Inception Distance (FID) metric.
Final Report on MITRE Evaluations for the DARPA Big Mechanism Program
Peterson, Matthew, Korves, Tonia, Garay, Christopher, Kozierok, Robyn, Hirschman, Lynette
This report presents the evaluation approach developed for the DARPA Big Mechanism program, which aimed at developing computer systems that will read research papers, integrate the information into a computer model of cancer mechanisms, and frame new hypotheses. We employed an iterative, incremental approach to the evaluation of the three phases of the program. In Phase I, we evaluated the ability of system and human teams ability to read-with-a-model to capture mechanistic information from the biomedical literature, integrated with information from expert curated biological databases. In Phase II we evaluated the ability of systems to assemble fragments of information into a mechanistic model. The Phase III evaluation focused on the ability of systems to provide explanations of experimental observations based on models assembled (largely automatically) by the Big Mechanism process. The evaluation for each phase built on earlier evaluations and guided developers towards creating capabilities for the new phase. The report describes our approach, including innovations such as a reference set (a curated data set limited to major findings of each paper) to assess the accuracy of systems in extracting mechanistic findings in the absence of a gold standard, and a method to evaluate model-based explanations of experimental data. Results of the evaluation and supporting materials are included in the appendices.
Helicopter Track Identification with Autoencoder
Wang, Liya, Lucic, Panta, Campbell, Keith, Wanke, Craig
Computing power, big data, and advancement of algorithms have led to a renewed interest in artificial intelligence (AI), especially in deep learning (DL). The success of DL largely lies on data representation because different representations can indicate to a degree the different explanatory factors of variation behind the data. In the last few year, the most successful story in DL is supervised learning. However, to apply supervised learning, one challenge is that data labels are expensive to get, noisy, or only partially available. With consideration that we human beings learn in an unsupervised way; self-supervised learning methods have garnered a lot of attention recently. A dominant force in self-supervised learning is the autoencoder, which has multiple uses (e.g., data representation, anomaly detection, denoise). This research explored the application of an autoencoder to learn effective data representation of helicopter flight track data, and then to support helicopter track identification. Our testing results are promising. For example, at Phoenix Deer Valley (DVT) airport, where 70% of recorded flight tracks have missing aircraft types, the autoencoder can help to identify twenty-two times more helicopters than otherwise detectable using rule-based methods; for Grand Canyon West Airport (1G4) airport, the autoencoder can identify thirteen times more helicopters than a current rule-based approach. Our approach can also identify mislabeled aircraft types in the flight track data and find true types for records with pseudo aircraft type labels such as HELO. With improved labelling, studies using these data sets can produce more reliable results.
Autoencoding Features for Aviation Machine Learning Problems
Wang, Liya, Lucic, Panta, Campbell, Keith, Wanke, Craig
The current practice of manually processing features for high-dimensional and heterogeneous aviation data is labor-intensive, does not scale well to new problems, and is prone to information loss, affecting the effectiveness and maintainability of machine learning (ML) procedures. This research explored an unsupervised learning method, autoencoder, to extract effective features for aviation machine learning problems. The study explored variants of autoencoders with the aim of forcing the learned representations of the input to assume useful properties. A flight track anomaly detection autoencoder was developed to demonstrate the versatility of the technique. The research results show that the autoencoder can not only automatically extract effective features for the flight track data, but also efficiently deep clean data, thereby reducing the workload of data scientists. Moreover, the research leveraged transfer learning to efficiently train models for multiple airports. Transfer learning can reduce model training times from days to hours, as well as improving model performance. The developed applications and techniques are shared with the whole aviation community to improve effectiveness of ongoing and future machine learning studies.
A Model-Based, Decision-Theoretic Perspective on Automated Cyber Response
Booker, Lashon B., Musman, Scott A.
Cyber-attacks can occur at machine speeds that are far too fast for human-in-the-loop (or sometimes on-the-loop) decision making to be a viable option. Although human inputs are still important, a defensive Artificial Intelligence (AI) system must have considerable autonomy in these circumstances. When the AI system is model-based, its behavior responses can be aligned with risk-aware cost/benefit tradeoffs that are defined by user-supplied preferences that capture the key aspects of how human operators understand the system, the adversary and the mission. This paper describes an approach to automated cyber response that is designed along these lines. We combine a simulation of the system to be defended with an anytime online planner to solve cyber defense problems characterized as partially observable Markov decision problems (POMDPs).